Welcome to Jupyter

First things first, let's get some terminology straight.

The language we're working in – Python 3.7
The editor we're using is Google Colab – The code runs on Google's servers, and shows the results on our browser
The specific notebook we're looking at now is an interactive Python notebook, a .ipynb file. These are pretty special, also known as Jupyter notebooks.

Jupyter notebooks have a few special properties that make it ideal for work with data:

  • Code is organized into cells, which can be code or markdown
  • We can run the cells in any order, try it out!
  • The last item returned in a cell will print automatically, no need to wrap it with print()
In [1]:
x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
In [2]:
print(x) # Run this cell after running the one above, and again after running the one below
Answer to the Ultimate Question of Life, the Universe, and Everything
In [3]:
x = 42

Importing packages

We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values

We'll rename these as pd and np, just so its easier to refer to later on

In [4]:
# as allows us to rename the packages
import pandas as pd
import numpy as np

Importing data

For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.

To import this, let's use the pd.read_csv() function:

In [5]:
# Replace w/ URL
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/trees.csv'
trees = pd.read_csv(url)

Here, we've saved the data to a dataframe object named trees

In [6]:
trees.shape
Out[6]:
(36073, 13)
In [7]:
type(trees)
Out[7]:
pandas.core.frame.DataFrame

Exploring dataframes

Let's take a look at the data. We'll use the functions .head() and .tail()

In [8]:
trees.head()

trees.head(10) # Show 10

trees.sample(3) # Choose 3 randomly
Out[8]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
16081 48910 DPW Maintained Private 6.0 NaN Magnolia grandiflora Southern Magnolia 2002-11-22 Sidewalk Cutout 37.796043 -122.428178 1885 Vallejo St
5702 68936 DPW Maintained Private 14.0 NaN Tristaniopsis laurina 'Elegant' Small-leaf Tristania 'Elegant' 1993-12-29 Sidewalk Cutout 37.779271 -122.469447 478 11th Ave
23392 90628 DPW Maintained DPW 3.0 3x3 Arbutus 'Marina' Hybrid Strawberry Tree 2008-04-07 Sidewalk Cutout 37.774210 -122.491042 3000X Cabrillo St

How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)

In [9]:
trees.shape
Out[9]:
(36073, 13)

Let's take a look at some of values in the dataset.
- What are the different caretaker types?
- How many unique trees are there in the dataset?

In [10]:
trees.species_name.nunique()
Out[10]:
367
In [11]:
trees.caretaker.unique()
Out[11]:
array(['Private', 'DPW', 'SFUSD', 'Dept of Real Estate', 'Health Dept',
       'Rec/Park', 'Purchasing Dept', 'Port', 'War Memorial',
       'Arts Commission', 'Office of Mayor', 'Police Dept', 'PUC',
       'Public Library', 'MTA', 'Fire Dept', 'DPW for City Agency',
       'Mayor Office of Housing', 'Housing Authority', 'City College'],
      dtype=object)

Which tree shows up the most frequently?

In [12]:
trees.common_name.value_counts()
Out[12]:
Swamp Myrtle              2781
Brisbane Box              2751
Hybrid Strawberry Tree    1968
Victorian Box             1604
Southern Magnolia         1602
                          ... 
Katsura tree                 1
Umbrella tree                1
Cabada palm                  1
Pindo Palm                   1
African Sumac                1
Name: common_name, Length: 365, dtype: int64

What are the biggest trees?
Note: DBH represents diameter of the tree at standing height

In [13]:
trees.sort_values(by='dbh', ascending=False).head()
Out[13]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
34738 14513 DPW Maintained DPW 100.0 4X4 Fraxinus uhdei Shamel Ash: Evergreen Ash 2018-06-18 Sidewalk Cutout 37.776560 -122.446728 501 Masonic Ave
28183 12738 DPW Maintained DPW 100.0 4x4 Tristaniopsis laurina 'Elegant' Small-leaf Tristania 'Elegant' 2013-07-12 Sidewalk Cutout 37.786183 -122.477196 1630 Lake St
5025 4768 DPW Maintained DPW 100.0 3X3 Corymbia ficifolia Red Flowering Gum 1993-01-05 Sidewalk Cutout 37.732715 -122.385231 26 Commer Ct
17964 24961 DPW Maintained DPW 90.0 20 Phoenix canariensis Canary Island Date Palm 2005-04-21 Median Cutout 37.767709 -122.426675 100 Dolores St
5581 13104 DPW Maintained DPW 90.0 3X3 Ficus retusa nitida Banyan Fig 1993-10-26 Sidewalk Cutout 37.801143 -122.426724 1530 Lombard St

Subsetting

Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:

Let's take a look at just the title, channel, views and likes. We can place these column names into a list, then subset the original dataframe by that list

In [14]:
cols = ['species_name', 'common_name', 'address']
trees_subset = trees[cols]
# Same thing as trees[['species_name', 'common_name', 'address']]
trees_subset.head()
Out[14]:
species_name common_name address
0 Pittosporum undulatum Victorian Box 501 Arkansas St
1 Magnolia grandiflora Southern Magnolia 2828 Divisadero St
2 Ginkgo biloba Maidenhair Tree 601 29th St
3 Ginkgo biloba Maidenhair Tree 601 29th St
4 Arbutus 'Marina' Hybrid Strawberry Tree 601 29th St

We can filter rows from a dataframe based on some condition

- Show only trees north of Golden Gate Park (latitude > 37.77285)?
- Show only Cherry Plum trees
- How about trees only on Front, Back, and Side Yards?

In [15]:
trees[trees.latitude > 37.77285]
Out[15]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
1 30321 DPW Maintained Private 2.0 NaN Magnolia grandiflora Southern Magnolia 1956-01-06 Sidewalk Cutout 37.795718 -122.441860 2828 Divisadero St
5 30339 DPW Maintained Private 11.0 NaN Platanus x hispanica Sycamore: London Plane 1956-02-15 Sidewalk Cutout 37.793189 -122.441380 2560 Divisadero St
6 30337 DPW Maintained Private 12.0 NaN Platanus x hispanica Sycamore: London Plane 1956-02-15 Sidewalk Cutout 37.793242 -122.441395 2560 Divisadero St
7 30341 DPW Maintained Private 10.0 NaN Acacia melanoxylon Blackwood Acacia 1956-02-15 Sidewalk Cutout 37.805913 -122.437521 3789 Fillmore St
20 30418 DPW Maintained Private 12.0 NaN Platanus x hispanica Sycamore: London Plane 1956-03-26 Sidewalk Cutout 37.797295 -122.440879 2509 Filbert St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
36068 144227 DPW Maintained Private 0.0 Width 4ft Agonis flexuosa Peppermint Willow 2020-01-25 Sidewalk Cutout 37.773933 -122.503557 782 43rd Ave
36069 144230 DPW Maintained Private 0.0 Width 4ft Melaleuca quinquenervia Cajeput 2020-01-25 Sidewalk Cutout 37.775598 -122.503676 696 43rd Ave
36070 261517 DPW Maintained Private 3.0 Width 3ft Agonis flexuosa Peppermint Willow 2020-01-25 Sidewalk Yard 37.775886 -122.501730 679 41st Ave
36071 144157 DPW Maintained Private 0.0 Width 4ft Tristaniopsis laurina Swamp Myrtle 2020-01-25 Sidewalk Cutout 37.774642 -122.501452 746 41st Ave
36072 144192 DPW Maintained Private 0.0 Width 4ft Lophostemon confertus Brisbane Box 2020-01-25 Sidewalk Cutout 37.776940 -122.502697 618 42nd Ave

15811 rows × 13 columns

In [16]:
trees[trees.site_location.isin(['Front Yard','Side Yard','Back Yard'])]
Out[16]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
4369 9448 DPW Maintained Rec/Park 12.0 3X3 Maytenus boaria Mayten 1992-05-18 Front Yard Yard 37.780637 -122.424279 800X Golden Gate Ave
5889 9445 DPW Maintained Rec/Park 12.0 3X3 Maytenus boaria Mayten 1994-03-01 Front Yard Yard 37.780677 -122.423969 800X Golden Gate Ave
14611 241180 DPW Maintained Private 7.0 Width 3ft Trachycarpus fortunei Windmill Palm 2001-02-20 Front Yard Yard 37.769557 -122.426830 300 Duboce Ave
14613 241179 DPW Maintained Private 7.0 Width 3ft Trachycarpus fortunei Windmill Palm 2001-02-20 Front Yard Yard 37.769562 -122.426784 300 Duboce Ave
14618 241181 DPW Maintained Private 8.0 Width 3ft Trachycarpus fortunei Windmill Palm 2001-02-20 Front Yard Yard 37.769559 -122.426874 300 Duboce Ave
... ... ... ... ... ... ... ... ... ... ... ... ... ...
35808 180799 Permitted Site Private 3.0 8x8 Eriobotrya deflexa Bronze Loquat 2019-10-23 Front Yard Yard 37.717753 -122.446970 315 Mount Vernon Ave
35883 261055 Private Private 3.0 3x3 Afrocarpus gracilior Fern Pine 2019-11-20 Front Yard Cutout 37.736072 -122.396149 1948 Quesada Ave
35884 261056 Private Private 3.0 3x3 Afrocarpus gracilior Fern Pine 2019-11-20 Front Yard Cutout 37.736110 -122.396217 1952 Quesada Ave
35891 259322 Significant Tree Private 3.0 3x3 Lophostemon confertus Brisbane Box 2019-11-25 Front Yard Yard 37.755691 -122.476717 1721 19th Ave
35961 261252 Significant Tree Private 3.0 NaN Banksia integrifolia Coast Banksia 2019-12-18 Front Yard Cutout NaN NaN 100 Collins St

201 rows × 13 columns

In [17]:
trees[trees.common_name == 'Cherry Plum']
Out[17]:
tree_id legal_status caretaker dbh plot_size species_name common_name date site_location site_type latitude longitude address
149 53700 Permitted Site Private 14.0 NaN Prunus cerasifera Cherry Plum 1970-03-04 Sidewalk Cutout 37.746081 -122.426025 263 Duncan St
198 54020 DPW Maintained Private 13.0 NaN Prunus cerasifera Cherry Plum 1972-04-07 Sidewalk Cutout 37.772780 -122.494875 862 35th Ave
208 54057 DPW Maintained Private 8.0 NaN Prunus cerasifera Cherry Plum 1972-04-21 Sidewalk Cutout 37.772551 -122.494860 874 35th Ave
265 54255 Permitted Site Private 10.0 3x3 Prunus cerasifera Cherry Plum 1972-07-03 Sidewalk Cutout 37.759509 -122.442802 191 Caselli Ave
364 221734 DPW Maintained Private 12.0 Width 4ft Prunus cerasifera Cherry Plum 1972-08-17 Sidewalk Cutout 37.765292 -122.452934 203 Carl St
... ... ... ... ... ... ... ... ... ... ... ... ... ...
35535 55973 DPW Maintained Private 3.0 NaN Prunus cerasifera Cherry Plum 2019-06-10 Sidewalk Cutout 37.791259 -122.432719 2221 Webster St
35571 236272 DPW Maintained Private 3.0 Width 3ft Prunus cerasifera Cherry Plum 2019-07-26 Sidewalk Cutout 37.766989 -122.416495 99 Shotwell St
35572 236271 DPW Maintained Private 3.0 Width 3ft Prunus cerasifera Cherry Plum 2019-07-26 Sidewalk Cutout 37.767032 -122.416501 99 Shotwell St
35700 246210 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767967 -122.443800 725 Buena Vista Ave West
35701 246211 DPW Maintained Private 3.0 Width 0ft Prunus cerasifera Cherry Plum 2019-10-01 Sidewalk Cutout 37.767917 -122.443821 725 Buena Vista Ave West

1180 rows × 13 columns

Visualization

First things first, let's import the package to help us visualize the data, plotly.

If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.

In [18]:
import plotly.express as px

## Run the following if graphs don't show
# import plotly.io as pio
# pio.renderers.default='notebook'

Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps

Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell

In [19]:
px.scatter?
In [20]:
fig = px.scatter(trees.sample(frac=.1), x='date', y='dbh')
fig.show('notebook')

Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters

In [22]:
trees_sample = trees.sample(frac=.2)
In [23]:
fig = px.scatter(trees_sample, x='date', y='dbh', 
                 opacity=.15, color='site_location', 
                 hover_name='common_name', hover_data=['site_location','site_type','address'],
                 marginal_x = 'histogram', marginal_y = 'histogram',
                 color_discrete_sequence = px.colors.qualitative.Prism[4:]
                )
fig.show('notebook')

Geographic Plots

The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.
Is there a general area in which there are more roadside / median trees?

In [24]:
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', 
                        color='site_location', size='dbh', opacity=.4,
                        color_discrete_sequence=px.colors.qualitative.Prism[4:],
                        hover_name='address',hover_data=['common_name','site_location','caretaker'],
                        zoom=11, mapbox_style="stamen-terrain",
                       )
fig.show('notebook')
In [ ]: